[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) by AnirudhRahul · Pull Request #962 · openai/parameter-golf

AnirudhRahul · 2026-03-27T15:18:55Z

Summary

Supersedes Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated) #931 for this line of work with the final low eval-time memory regime: the packed order-2..9 training n-gram artifact and learned gate remain, but the logistic context mixer and long phrase cache are removed from the final eval path.
Final 3-seed mean val_bpb is 0.02139943 +/- 0.00003918; worst-case total submission size is 15,881,331 bytes and worst-case eval time is 437s.
All reported runs stay within budget: training <600s, eval <600s, artifact <16MB.
The main auxiliary eval-time state is the fixed 2 MiB order-2..9 n-gram cache: 32K buckets with two uint32 count tables per order. This is the primary persisted state beyond the transformer itself and it does not grow with validation length.
The submission keeps the compliant causal path: the n-gram cache persisted from training time is included as part of the artifact itself, expert availability is context-only, GPTQ calibration uses cached training batches, the output distribution is normalized to sum to 1 for each token, and the reported path uses TTT_EPOCHS=0.

Results

Seed	Final val_bpb	Artifact bytes	Total bytes	Eval time
1337	0.02144330	15,015,946	15,179,538	432s
42	0.02136791	15,717,739	15,881,331	433s
7	0.02138708	15,083,362	15,246,954	437s

3-seed mean val_bpb: 0.02139943

Sample std: 0.00003918

Causal Inference Scheme

Deserialize the packed order-2..9 n-gram cache from the submitted artifact at eval start.
Score each validation chunk once using only left context and the current cache state.
Query n-gram experts using left context only; the learned gate's expert-availability mask depends only on context evidence.
Blend neural + n-gram experts, then renormalize the full-vocabulary distribution so it sums to 1 before scoring.
Update the streaming n-gram cache only after the chunk has already been scored.
Report the final single-pass path with TTT_EPOCHS=0.

Compliance

This is not a 2-pass method.
Validation is scored in a single causal pass: each chunk is scored before that chunk is used for any cache update.
The warm-start cache used at eval step 0 is part of the artifact itself, not a separate runtime input.
The n-gram cache persisted from training time is included as part of the artifact and deserialized at eval start.
The packed n-gram cache in the artifact is derived from training data only and is produced within the 600 second training budget.
The learned gate does not use the true next token to decide which experts are available.
GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run; it does not reopen training shards after the official wallclock limit.
The output distribution is normalized to sum to 1 for each token before likelihood is accumulated.
The current reported numbers use TTT_EPOCHS=0, so there is no backward test-time adaptation in the final submission path.
No future validation tokens are visible when scoring the current chunk.

Reproduction

pip install -r records/track_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/requirements.txt

cd records/track_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer

SEED=1337 \
DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
VAL_LOSS_EVERY=0 \
USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Submission Checklist

One new folder added under records/track_10min_16mb
README.md included
submission.json included
train_gpt.py included
Train logs included (train_seed1337.log, train_seed42.log, train_seed7.log)
Train and eval under 10 minutes
Artifact under 16MB
No tokenizer/dataset edits
Score-first ordering preserved (no hindsight path)

This updates the packed training n-gram artifact submission with the final no-mixer, no-phrase 3-seed reruns and documents the causal single-pass evaluation path. Made-with: Cursor

AnirudhRahul · 2026-03-27T15:30:06Z

This submission attempts to address some of the concerns about memory usage from ngram-caches (mentioned as being potentially problematic here #886).
And only uses an extra 2mb of memory to track its n-gram state

This replaces the prior point-scored results with renormalized 3-seed reruns so the final output distribution sums to 1 at every token and the published BPB reflects the normalized path. Made-with: Cursor

AnirudhRahul · 2026-03-27T17:54:41Z

#677 (comment)

Add low eval-time memory no-phrase record folder

5777687

This updates the packed training n-gram artifact submission with the final no-mixer, no-phrase 3-seed reruns and documents the causal single-pass evaluation path. Made-with: Cursor

notapplica mentioned this pull request Mar 27, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

AnirudhRahul mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Update low eval-time memory record with renormalized scoring

e5a0cbc

This replaces the prior point-scored results with renormalized 3-seed reruns so the final output distribution sums to 1 at every token and the published BPB reflects the normalized path. Made-with: Cursor

AnirudhRahul changed the title ~~Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)~~ [Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) Mar 27, 2026

AnirudhRahul closed this Mar 27, 2026

haikosys mentioned this pull request Mar 27, 2026

Record: Fort Knox — Legal Packed Training Cache, Zero Val Adaptation (val_bpb 0.0638, 3-seed) #982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)#962

[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)#962
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record/low-eval-memory-no-phrase-00214

AnirudhRahul commented Mar 27, 2026 •

edited

Loading

Uh oh!

AnirudhRahul commented Mar 27, 2026

Uh oh!

AnirudhRahul commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AnirudhRahul commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Causal Inference Scheme

Compliance

Reproduction

Submission Checklist

Uh oh!

AnirudhRahul commented Mar 27, 2026

Uh oh!

AnirudhRahul commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AnirudhRahul commented Mar 27, 2026 •

edited

Loading